We collected data from GitHub, BioConductor and CRAN. In this notebook, we will merge those data in a single file that will be used to do our analysis. The expected format of the file will be something like:
In [9]:
fields = ['Package', 'Version', 'Source', 'Date', 'License', 'Suggests', 'Imports', 'Depends', 'Owner',
'Repository', 'CommitDate', 'CRANRelease', 'SnapshotFirstDate', 'SnapshotLastDate', 'BiocDate',
'BiocVersion', 'BiocCategory']
OUTPUT = '../data/github-cran-bioc-alldata.csv'
In [10]:
import pandas
github = pandas.DataFrame.from_csv('../data/github-raw-2015-05-04.csv')
cran = pandas.DataFrame.from_csv('../data/cran-deps-history-2015-04-20.csv', index_col=None)
bioc = pandas.DataFrame.from_csv('../data/bioconductor-2015-05-05.csv')
Forget that: For data coming from Github, we do some preprocessing: if a pair (package, version) has many instances, we keep the oldest one.
In [11]:
# github = github.sort('CommitDate')
# github = github.drop_duplicates(('Package', 'Version'), take_last=False)
The same (not?) applies for BioConductor.
In [12]:
# bioc = bioc.sort('BiocDate')
# bioc = bioc.drop_duplicates(('Package', 'Version'), take_last=False)
The following function parses the dependencies and return a list of strings.
In [13]:
def parse_dependencies(str_list, ignored=[]):
"""
Return a list of strings where each string is a package name not in `ignored`.
The input is a list of dependencies as contained in a DESCRIPTION file.
"""
# Check NaN
str_list = str_list if str_list != pandas.np.nan else ''
# Filter version numbers
f = lambda lst: [dep.split('(')[0].strip() for dep in lst.split(',')]
items = filter(lambda x: len(x) > 0, f(str_list))
items = filter(lambda x: x not in ignored, items)
return items
We now merge the three datasets into one big dataset, and apply some processing (parse_dependencies).
In [14]:
cran['Source'] = 'cran'
cran['Date'] = cran['SnapshotFirstDate']
github['Source'] = 'github'
github['Date'] = github['CommitDate']
bioc['Source'] = 'bioc'
bioc['Date'] = bioc['BiocDate']
# Merge
packages = pandas.concat([cran, github, bioc])
# Deal with dependencies lists
dependencies_formatter = lambda x: ' '.join(parse_dependencies(x))
for field in ['Suggests', 'Imports', 'Depends']:
packages[field] = packages[field].fillna(value='').apply(dependencies_formatter)
# Convert date
packages['Date'] = pandas.to_datetime(packages['Date'])
# Remove useless packages (see http://cran.r-project.org/doc/manuals/r-release/R-exts.html#Creating-R-packages)
# The mandatory ‘Package’ field gives the name of the package.
# This should contain only (ASCII) letters, numbers and dot, have at least two characters and
# start with a letter and not end in a dot.
packages = packages.dropna(subset=['Version', 'Package', 'Date'])
packages = packages[packages.Package.str.match(r'^[a-zA-Z][a-zA-Z0-9\.]+$')]
output = packages[fields].sort('Package')
In [15]:
output.to_csv(OUTPUT, encoding='utf-8')